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A METHOD AND SYSTEM FOR PREDICTING 
THE BIOLOGICAL ACTIVITY, INCLUDING TOXICOLOGY AND TOXICITY, 

OF SUBSTANCES 

5 CROSS-REFERENCE TO RELATED APPLICATIONS 

This application claims priority to and incorporates herein by reference in its 
entirety United States Provisional Patent Application: No. 60/263,161 entitled "A Method 
And System For Predicting The Biological Activity, Including Toxicology And Toxicity, 
10 Of Substances," filed Januaiy 23, 2001. 

BACKGROUND OF THE INVENTION 

Field of the Invention: 

The present invention relates generally to a system and method for predictively 
15 assessing the biological activity of a substance, and, more specifically, the toxicity and 
toxicology of a substance, utilizing a multi-variate statistical analysis of multiple gene 
expression patterns in response to that substance. 

Description of the Related Art: 
20 At least 55,000 chemicals are presently produced in the United States and over 

2,000 new chOTiicals are introduced into the market each year. Very few of these 
chemicals have been comprehensively tested for acute or chronic toxicity. For example, 
less than 1 percent of commercial chemicals have undergone complete health hazard 
assessment. 



1 
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The Environmental Protection Agency ("EPA") has the authority to require 
toxicological testing of a chetnical prior to commercial production, but that authority is 
rarely invoked. Less than 10 percent of new chemicals are subjected to detailed review by 
the EPA. In the interest of cost and speedy access to the market, the EPA often uses the 
toxicity of previously tested homologous compounds to gauge the toxicity of a new 
chemical. 

The potential toxicity of new drugs is monitored by the Food and Drag 
Administration ("FDA"). For a New Drug AppUcation (NDA), the FDA typically requires 
a large battery of toxicity, carcinogenicity, mutagenicity and reproduction/fertility tests in 
at least two species of live animals. These tests are required to last up to one year. The 
costs mvolved in completing these tests is enormous. For example, a typical 90-day 
exposure toxicity test in rats costs approximately $100,000. A two year toxicity test in 
rats costs approximately $800,000 (Casarett and Doull's Toxicology, 4th Edition, M. O. 
Amdur et al., eds. Pergamon Press, New York, New York, p. 37 (1991)). 

In addition, toxicity testing is a necessary and time-consuming part of the 
pharmaceutical drug development pipeline. A research tool that would allow for accurate 
predictions regarding the toxicity of a substance, such as a lead drug candidate, without 
conducting costly and time-consxuning in vivo studies would greatly facilitate 
pharmaceutical research. 

Besides cost, animal testing also presents disadvantages in terms of time, animal 
suffering and accuracy. Typical toxicity tests are divided into three stages: acute, short 
term and long term. Acute tests, which detemiine the LD50 of a compound (the dose at 
which 50% of test animals are killed), require some 60-100 animals and a battery of tests 
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for dete rmining LDso* dose-response curves and for monitoring clinical end points, other 
than death. Short term tests usually involve at least 24 dogs and 90 rats and last from 90 
days in rats to 6-24 months in dogs. Body wei^t, food consumption, blood, uiine and 
tissue samples are frequently measured in the short-terai tests. In addition, dead animals 
5 are subjected to post-mortem examinations. Long terai tests are similar to short term 
tests, but last 2 years in rats and up to 7 years ia dogs or monkeys. 

Anicoal testing has come vmd&t criticism by animal rights activists and the general 
public because of the severe suffering inflicted on the animals. Moreover, recent evidence 
calls into question the accuracy of animal testing. For example, variables, such as animal 
10 diet, may impair the predictability of animal tests in determining carcinogenic properties. 
P. H. Abelson, "Diet and Cancer in Humans and Rodents", Science, 255, p. 141 (1992). 
Prior determinations on dioxin toxicity, based on guinea pig testing, are now being 
reevaluated B. J. Culliton, "U.S. Government Orders New Look At Dioxin", Nature, 
352, p. 753 (1991); L. Roberts, "More Pieces in the Dioxin Puzzle", Research News, 
15 October, 1991, p. 377. It is therefore apparent that there is an urgent need for a quick, 
inexpensive and reKable alternative to toxicity testing in animals. 

Several short-tenn alternative tests are available. For example, the Ames Assay 
detects carcinogens which cause genetic reversion of mutant strains of Salmonella 
typhimuriimi. 

^0 U.S. Patent No. 5,736,35, issued to Fielden, et aL, discloses a method of 

determining the toxicity of a fluid sample comprising mixing the sample with a 
suspension of light emitting organisms; monitoring the light output of the mixture 
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continually over a period of time ; and providing an assessment of toxicity based on 
changes in ligjit transmission. 

U.S. Patent No. 5,702,915, issued to Miyamoto, discloses a biosensor for 
detecting the toxicity of a sample which includes a solid-state area image pickup element, 
a culture container positioned on an upper surface of a light-receiving portion of the 
element, a cell cultured in the culture container, and culture medium for growing the cell. 

U.S. Patent No. 5,589,337, issued to Farr, discloses diagnostic kits for 
detennining the toxicity of a compound employing a plurality of bacterial hosts, each of 
which harbors a DNA sequence encoding a different stress promoter fused to a gene 
which encodes an assayable product. 

U.S. Patent No. 5,569,580, issued to Young, discloses a method for the in vitro 
testing of chemicals to determine toxicity using hyperactivated rabbit spermatozoa, 

U.S. Patent No. 6,160,105, issued to Cunningham, et al., discloses methods for 
screening componnds for toxicological responses employing a composition comprising a 
plurality of polynucleotide targets used as hybridizable array elements in a microarray. 

However, these assays suffer jfrom a significant shortcoming in that none of these 
tests permit a predictive assessment of the biological activity, toxicology, and toxicity of 
a substance 

As examples of substances with toxic efiFects, carbon tetra chloride (CCI4), which 
causes hepatitis, when introduced into liver cells of a mature rat, produces a leak-out and 
change of cell morphology of enzymes such as glutamic-pyruvic transaminase (GPT), 
glutamicoxaloacetic transanMnase (GOT) and lactate dehydrogenase (LDH). Based on 
this fact, there has been proposed a possibility of detecting hepatotoxin. 
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Benzo(a)pyrene is a known rodent and likely human carcinogen and is tiie 
prototype of a class of compounds, the polycycUc aromatic hydrocarbons. It is 
metabolized by several forms of cytochrome P450 and associated enzymes to both 
activated and detoxified metaboUtes Degawa et al. (1994) Cancer Res. 54: 4915-4919). 
The ultimate metabolites are the bay-region diol epoxide, benzo(a)pyrene-7,8-diol-9,10- 
epoxide (BPDE) and the K-region diol epoxide, 9-hydroxy benzo(a)pyrene-4,5-oxide, 
which have been shown to cause DNA adduct formation (alkylation of guanine bases). 
DNA adducte have been shown to persist in rat liver up to 56 days following treatment 
with benzo(a)pyrene at a dose of 10 mg/kg body weight 3 times per week for 2 weeks (Qu 
and Stacey, (1996) Carcinogenesis 17: 53-59). 

Acetaminophen is a widely-used analgesic. It is metabolized by specific 
cytochrome P450 isozymes witii the majority of the drug undergoing detoxification by 
glucuronic acid, sulfate and glutathione conjugation pathwaj^ (Chen et al. (1998) Chem. 
Res. Toxical 11: 295-301). However, at higjinon-ther^eutic doses, acetaminophen can 
cause hepatic and renal failure by being metabolized to an active intermediate, N-acetyl- 
p-benzoquinone imine (NAPQI). NAPQI then binds to sulfhydryl groups of proteins 
causing their inactivation and leading to subsequent cell death (Kroger et al. Gen. 
Phamiacol. (1997 28: 257-263). 

Clofibrate is an antilipidemic drug which lowers elevated levels of serum 
triglycerides, hi rodents, chronic treatment produces hepatomegaly, an increase in hepatic 
peroxisomes Lock et al. (1989) Ann. Rev. Pharmacol. Toxicol. 29: 145-163). Clofibrate 
has been shown to increase levels of cytochrome P450 4A and reduce flie levels of P450 
4F (Kawashima et al. (1997) Arch. Biochem. Biophys. 347: 148-154). It is also involved 
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in transcription of -oxidation genes as well as induction of peroxisome proliferator 
activated receptors Kawashima supra 

Thus, there remains a need for an ejBficient and effective system and method for 
predictively assessing the biological activity of a substance, and, more specilBcally, the 
toxicity and toxicology of a substance, utilizing a multi-variate statistical analysis of 
multiple gene expression patterns in response to that substance. 

BRIEF SUMMARY OF THE INVENITQN 

It is a feature and advantage of the present invention to provide an improved 
system and method for predictively assessing the biological activity of a substance. 

It is a further feature and advantage of the present invention to provide an 
improved system and method for predictively assessing the toxicology of a substance. 

It is a further feature and advantage of the present invention to provide an 
improved system and method for predictively assessing the toxicity of a substance. 

To achieve the stated and other features, advantages and objects, an embodiment 
of the present invention provides an improved system and method for predictively 
assessing the biological activity of a substance, and, more specifically, the toxicity and 
toxicology of a substance, utiUzing a multi-variate statistical analysis of multiple gene 
expression patterns in response to that substance. 

This system and method employs the use of gene expression microarrays. For 
example, microarrays consisting of full length genes or gene fragments on a substrate may 
be formed. These arrays can then be tested with samples treated with a substances to 
elucidate the gene expression pattern associated with treatment with the substance. This 
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gene pattern can be compared with gene expression patterns of compounds associated 
with known toxicological responses. 

The present invention provides also systems and methods for the screening, 
preferably in a microarray format, of compounds and therapeutic treatments for 
5 toxicological effects. 

Additional objects, advantages and novel features of the invention will be set forth 
in part in the description which follows, and in part will become more apparent to those 
skilled in the art upon examination of the following, or may be learned by practice of the 
inventioiL 

10 

BRIEF DESCRIPTION OF THE DRAWINGS 
Figures la, lb, Ic, and Id present four preferred patterns for illustrating the 
response of a gene or set of genes to a chemical. 

Figure 2 presents the principal component analysis of the CCI4 data. 
15 Figure 3 presents the principal component analysis of the APAP data. 

Figure 4 presents the APAP predictive similarity model. 
Figure 5 presents the CCI4 predictive similarity model. 

DETAILED DESCRIPTION OF THE INVENTTON 
10 The present invention pertains to the development of a method for assessing the 

toxicity and toxicology of a substance. In one preferred embodiment of the present 
invention, for each study, one derives a predictive model relating gene explosion to 
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toxicity such that it can be used to screai compounds. One then compares and cross- 
validates various models with other toxicological stadies so as to refine the models. 

It will be ^predated that in such a study, one relies xspon various study designs. 
These, preferably, include time (one or more time points); treatment (one or more doses); 
and vehicle (which may differ fi-om study to study). 

In a preferred embodiment of the present invention a minimnrn of three animals 
are tested per group. 

It will be further appreciated that treatments related to one or more toxic pathways 
may be explored, which treatments may differ from study to study. 

An aspect of the present invention is an analysis of the variance for each gene 
contrast analysis. In this gene contrast analysis, the req>onse of a gene or set of genes is 
monitored upon exposure to a chemical, hi one preferred embodiment, the response of a 
gene or set of genes to a chemical can be fitted into one of four patterns illustrated in 
Figures la, lb, Ic, and Id. In this preferred embodiment, upon classification into one of 
these four groups, an analysis is then performed which categorizes the gene contrast 
analysis as one of four summary scores. These summary scores are then subjected to 
logistic regression analysis, finnishing a predictive model. 

In another preferred embodiment of the present invention, the input data for the 
analysis of the variance for each gene contrast analysis is the average difference for all 
samples and all genes. In yet another preferred embodiment of the present invention, the 
analysis fits two factors (for example, time and dose) in an analysis of variance 
(ANOVA) methodology, using contrast analysis to assign each gene to a patteriL hi still 
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another preferred embodiment, the output comprises a correlation of a Ust of patterns and 
a Ust of genes within each pattern, coupled with a measure of the fit. 

In still another preferred embodiment of the present invention, responses of a gene 
or set of genes to a chemical that fit into patterns corresponding to either Figures la or lb 
are subjected to analysis which categorizes the gene contrast analysis as one of four 
summary scores, hi such an embodiment, the input data are genes selected from patterns 
that are biologically relevant to the toxicological process; the analysis is performed for all 
samples on selected genes; and the output data comprises summary scores for each 
sample. 

In a further preferred aspect of this embodiment, the summary scores are subjected 
to logistic regression analysis, resulting in a predictive model. In this aspect of the 
embodiment, the input data are tiie summary scores per sample, which is an indicator for 
each sample; the analysis is a logistic regression analysis mapping the summary scores to 
a 0 to 1 scale of toxicity; and the out put data are one ort more mathematical formulae 
that converts a column of average differences into a single 0 to 1 toxicological score for a 
sample. 

It will be appreciated that another preferred aspect of the present invention is an 
assessment of false positive and false negative rates so as to test the validity of the 
predictive model. 

Another aspect of the present invention is the correlation of a predictive model 
with results obtained fi:om other studies. Thus, preferably, one seeks vaUdation of each 
model with vehicles and toxins firom other models. In this mode, non-similar toxins 
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should score low; similar toxins should score high; and vehicles should score low 
regardless of vehicle type. 

In correlating these other studies, one preferably compare gene lists for pattems of 
interest between studies of related compounds to arrive at a consensus set of genes 
5 involved in a toxicological response. 

In another preferred embodiment of the presmt invention, the goal of the method 
for assessing the toxicity and toxicology of a substance is to use gene expression to 
predict whether a compound has a high probability of being toxic at a given dose. In this 
preferred embodiment, pattems of gene expression can be compared against know *toxic" 
10 patterns and a similarity score calculated. Preferably, the methodology associated with 
this preferred embodiment includes identification of gene expression pattems associated 
with toxicity; quantification of this association; development of a statistical inference of 
similarity; and validation of results. 

It will be appreciated that in such a modeling, there cam be a nimiber of different 
15 types of markers, including general markers, group markers (for example, cholestasis, 
necrosis, stenosis), and compound specific markers. 

It will appreciated that there are preferred model attributes. These include: time 
stability (must be able to predict toxicity over an extended time range); dose dependency 
(should only score toxic doses of compounds); vehicle independence (should not be 
20 sensitive to type of vehicle xised); predictable (based on statistical inference with known 
false positive rate); and powerful (false negative rates should be low enough that 
singeltons or low number of replicates can adequately predict toxicity). 
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In another prefisrred embodiment of the present invention, there are various stages 
of model development. These, preferably, include: selection (detennination of relevant 
expression patterns that are time stable and dose dependent); quantification (production 
of composite measures that define patterns); prediction (use of composite measures to 
assign probability of patterns being the same); and validation (ability to provide statistical 
measures of model accuracy). 

It will be recognized that the present invention enables one to develop models for 
key compounds; cross-validate each model; identify false positives aiid false negatives; 
provide positive crossover; reduce models to best set of toxic markers; and predict the 
toxicity of unknown compounds. 

The expression similarity profiling for predictive toxicology models are developed 
based on the gene expression patterns of known toxic substances. The gene expression 
patterns of unknown chemicals are compared against these known patterns and a 
probability of similar toxic profile is produced. Recognizing these gene expression 
patterns and producing a single predictive score firom thousands of individual 
measurements involves the use of multiple established techniques in a non-obvious linear 
sequence. 

These techniques provide for selection of time-stable and dose-dependent toxic 
gene expression profiles via contrast analysis and selection of thoxisands of variables into 
one or more composite variables via principal component analysis (PCA). 

Use of composite variables allows one to make a predictive composite measure 
via logistic regression. In addition, the preset invention provides for validation of the 
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model by testing botb known toxic and non-toxic substances using this composite 
measure. 

The ability to tell whether a chemical compound has a high probability of being 
toxic based on its gene expression profile. This is a critical issue for the safety of 
5 potential pharmaceutical compounds 

The gene expression pattem caused by an unknown substance will be entered into 
a series of formulas. These formulas will tiien predict the likelihood of toxicity on a 0 to 
1 scale, 0 being the highest confidence in safety and 1 being the highest confidence in 
toxicity 

10 In one aspect, the invention provides a method for screening a compound for a 

toxicological effect The method comprises selecting a plurality of polynucleotide targets^ 
wherein the polynucleotide targets have first gene expression levels altered in tissues 
treated with known toxicological agents when compared with untreated tissues. Some of 
the first gene expression levels may be upregulated and others downregulated when 

15 associated with a toxicological response. A sample is treated with the compound to 
induce second gene expression levels of a plurality of polynucleotide probes. Then first 
and second gene expression levels are compared to identify those compounds that induce 
expression levels of the polynucleotide probes that are similar to those of the 
polynucleotide targets and the similarity or expression levels correlates with a 

20 toxicological effect of the compound. 

Preferred tissues are selected from the group consisting of hver, kidney, brain, 
spleen, pancreas and lung. Preferred toxicological agents are acetaminophen and other 
compounds with a similar mechanism of action. 
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Alternatively, the invention provides methods for screening a therapeutic 
treatment for a toxicological effect or for screening a sample for a toxicological response 
to a compoimd or ther^eutic treatm^t. 

In another aspect, the invention provides methods for preventing a toxicological 
S response by administering complementary nucleotide sequences against one or more 
selected upregulated polynucleotide targets or a ribozyme that specifically cleaves such 
sequences. Alternatively, a toxicological response may be prevented by administering 
sense nucleotide sequences for one or more selected downregulated polynucleotide 
targets. 

10 In yet another aspect, the invention provides methods for preventing a 

toxicological response by administering an agonist which initiates transcription of a gene 
comprising a downregulated polynucleotide of the invention. Alternatively, a 
toxicological response may be prevented by administering an antagonist which prevents 
transcription of a gene comprising an upregulated polynucleotide of the invention. 

15 Ohgonucleotide probes have long been used to detect complementary nucleic acid 

sequenbte in a nucleic acid of interest (the "target" nucleic acid) and have been used to 
detect expression of particular genes (e.g., a Northem Blot). In some assay formats, the 
oligonucleotide probe is tethered, i.e., by covalent attachment, to a solid support, and 
arrays of ohgonucleotide probes immobilized on sohd supports have been used to detect 

20 specific nucleic acid sequences in a target nucleic acid. See, e.g., PCT patent publication 
Nos. WO 89/10977 and 89/11548. Others have proposed the use of large numbers of 
oligonucleotide probes to provide tiie complete nucleic acid sequence of a target nucleic 
acid but failed to provide an enabling method for using arrays of immobilized probes for 
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this purpose. See U.S. Pat Nos, 5,202^31 and 5,002,867 and PCT patent pubKcation No. 
WO 93/17126. 

The use of "traditional" hybridization protocols for monitoring or quantifying 
gene ^ression is problematic. For exantiple two or more gene products of approximately 
5 the same molecular wei^t will prove difficult or impossible to distinguish in a Northern 
blot because they are not readily separated by electrophoretic methods. Similarly, as 
hybridization efGciency and cross-reactivity varies with the particular subsequence 
(region) of a gene being probed it is difficult to obtain an accurate and reliable measure of 
gene expression with one, or even a few, probes to the target gene. 

10 The development of VLSIPS technology provided methods for synthesizing arrays 

of many different oligonucleotide probes that occupy a very small surface area. See U.S. 
Pat. No. 5,143,854 and PCT No. WO 90/15070. U.S. Patent AppUcation Ser. No. 
082,937, filed Jun. 25, 1993, describes methods for making arrays of oligonucleotide 
probes that can be used to provide the complete sequence of a target nucleic acid and to 

15 detect the presence of a nucleic acid containing a specific nucleotide sequence. 

Prior to the present invention, however, it was imknown that high density 
oligonucleotide arrays could be used to reliably monitor message levels of a multiplicity 
of preselected genes in the presence of a large abundance of otiier (non-target) nucleic 
acids (e.g., in a cDNA library, DNA reverse transcribed from an mRNA, mRNA used 

20 directly or amplified, or polymerized from a DNA template). In addition, the prior art 
provided no rapid and effective method for identifying a set of oligonucleotide probes 
that maxinMze specific hybridization efficacy while minimizing cross-reactivity nor of 
using hybridization pattems (in particular hybridization patterns of a multiplicity of 

14 
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oligonucleotide probes in which multiple oligonucleotide probes are directed to each 
target nucleic acid) for quantification of target nucleic acid concentrations. 

The present invention is premised, in part, on the discovery that microfabncated 
arrays of large numbers of different oligonucleotide probes (DNA chips) may effectively 
5 be used to not only detect the presence or absence of target nucleic acid sequences, but to 
quantify the relative abundance of the target sequences in a complex nucleic acid pool. In 
particular, prior to this invention it was unknown that hybridization to high density probe 
arrays would permit small variations in expression levels of a particular gene to be 
identified and quantified in a complex population of nucleic acids that out number the 

10 target nucleic acids by 1,000 fold to 1,000,000 fold or more. 

Thus, this inv^tion employs a method of simultaneously monitoring the 
expression (e.g. detecting and or quantifying the expression) of a multiplicity of genes. 
The levels of transcription for virtually any number of genes may be determined 
simultaneously. Typically, at least about 10 genes, preferably at least about 100, more 

15 preferably at least about 1 000 and most preferably at least about 1 0,000 different genes 
are assayed at one time. 

The method involves providing a pool of target nucleic acids comprising mRNA 
transcripts of one or more of said genes, or nucleic acids derived firom the mRNA 
transcripts; hybridizing the pool of nucleic acids to an array of oligonucleotide probes 

20 immobilized on a surface, where the array comprises more than 100 different 

oligonucleotides, each different oligonucleotide is localized in a predetermiaed region of 
said surface, the density of the different oligonucleotides is greater than about 60 different 
oligonucleotides per 1 cm^, and the oligonucleotide probes are complementary to the 
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mRNA transcripts or nucleic acids derived from the mKNA transcripts; and quantifying 
the hybridized nucleic acids in the array. In a preferred embodiment, the pool of target 
nucleic acids is one in which the concentration of the target nucleic acids (niRNA 
transcripts or nucleic acids derived from the mRNA transcripts) is proportional to the 
5 expression levels of genes encoding those target nucleic acids. 

In a preferred embodiment, the array of oligonucleotide probes is a higjh density 
array comprising greater than about 100, preferably greater than about 1,000 more 
preferably greater than about 16,000 and most preferably greater than about 65,000 or 
250,000 or even 1,000,000 different oligonucleotide probes. Such high density arrays 

10 comprise a probe density of generally greater than about 60, more generally greater than 
about 100, most genially greater than about 600, often greater than about 1000, more 
often greater than about 5,000, most often greater than about 10,000, preferably greater 
than about 40,000 more preferably greater than about 100,000, and most preferably 
greater than about 400,000 different oligonucleotide probes per cm^. The oligonucleotide 

15 probes range from about 5 to about 50 nucleotides, more preferably from about 10 to 

about 40 nucleotides and most preferably from about 15 to about 40 nucleotides in length. 
The array may comprise more than 1 0, preferably more than 50, more preferably more 
than 1 00, and most preferably more than 1 000 oligonucleotide probes specific for each 
target gene. Although a planar array surface is preferred, the array may be fabricated on a 

20 surface of virtually any shape or evm a multiplicity of surfaces. 

The array may further comprise mismatch control probes. Where such mismatch 
controls are present, the quantifying step may comprise calculating the diflference in 
hybridization signal intensity betwem each of the oUgonucleotide probes and its 
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corresponding mismatch control probe. The quantifying may further comprise calculating 
the average difference in hybridization signal intensity between each of the 
oUgonucleotide probes and its corresponding mismatch control probe for each gene. 

The probes present in the higji density array can be oligonucleotide probes 
5 selected according to the optimization mefbods described below. Alternatively, non- 
optimal probes may be included in the array, but the probes used for quantification 
(analysis) can be selected according to the optimization methods described below. 

Oligonucleotide arrays for the practice of this invention are preferably synthesized 
by light-directed very large scaled immobilized polymer synthesis (VLSIPS) as described 
10 herein. The array includes test probes which are ohgonucleotide probes each of which has 
a sequence that is complementary to a subsequence of one of the genes (or &e mRNA or 
the corresponding antisense cRNA) whose expression is to be detected. Jn addition, flie 
array can contain normalization controls, mismatch controls and expression level controls 
as described herein. 

15 The pool of nucleic acids may be labeled before, during, or after hybridization, 

although in a preferred embodiment, the nucleic acids are labeled before hybridization. 
Fluorescence labels are particularly preferred and, where tised, quantification of the 
hybridized nucleic acids is by quantification of fluorescence from the hybridized 
fluorescently labeled nucleic acid. Such quantification is facilitated by the use of a 

20 fluorescence microscope which can be equipped with an automated stage to permit 
automatic scanning of the array, and which can be equipped with a data acquisition 
system for the automated measurement recording and subsequent processing of the 
fluorescence intensity information. 
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In a preferred embodiment, hybridization is at low stringency (e.g., about 20^ C. 
to about 50° C, more preferably about 30° C. to about 40° C, and most preferably about 
37° C. and 6X SSPE-T or lower) with at least one wash at higher stringency. 
Hybridization may include subsequent washes at progressively increasing stringency until 
5 a desired level of hybridization specificity is reached. 

The pool of target nucleic acids can be the total polyA.sup.+ mRNA isolated from 
a biological sample, or cDNA made by reverse transcription of the RNA or second strand 
cDNA or RNA transcribed from the double stranded cDNA intermediate. Alternatively, 
the pool of target nucleic acids can be treated to reduce the complexity of the sample and 

10 thereby reduce the background signal obtained in hybridization. In one approach, a pool 
of mRNAs, derived from a biological sample, is hybridized with a pool of 
oligonucleotides comprising the oHgonucleotide probes present in the high density array. 
The pool of hybridized nucleic acids is then treated with RNase A which digests the 
single stranded regions. The remaining double stranded hybridization complexes are then 

15 denatured and the oligonucleotide probes are removed, leaving a pool of mRNAs 
enhanced for those mRNAs complementary to the oligonucleotide probes in the high 
density array. 

In another approach to background reduction, a pool of mRNAs derived from a 
biological sample is hybridized with paired target specific oligonucleotides where the 
20 paired target specific oligonucleotides are complementary to regions flanking 

subsequences of the mRNAs complementaiy to the oHgonucleotide probes in the high 
density array. The pool of hybridized nucleic acids is treated with RNase H which digests 
the hybridized (double stranded) nucleic acid sequences. The remaining single stranded 
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nucleic acid sequences which have a length about equivalent to the region flanked by the 
paired target specific oligonucleotides are then isolated (e.g. by electrophoresis) and used 
as the pool of nucleic acids for monitoring gene expression. 

Finally, a third approach to background reduction involves eliminating or 
5 reducing the representation in the pool of particular preselected target naRNA messages 
(e.g., messages that are characteristically overexpressed in the sample). This method 
involves hybridizing an oligonucleotide probe that is complementary to the preselected 
target mRNA message to the pool of polyA.sup.+ mRNAs derived from a biological 
sample. The oligonucleotide probe hybridizes with the particular preselected polyA.sup.+ 

10 mRNA (message) to which it is complementary. The pool of hybridized nucleic acids is 
treated with RNase H which digests the double stranded (hybridized) region thereby 
separating the message from its polyAsup.H- tail. Isolating or amplifying (e.g., using an 
oUgo dT column) the polyA.sup.+ mRNA in the pool then provides a pool having a 
reduced or no representation of the preselected target mRNA message. 

15 It will be appreciated that the methods of this invention can be used to monitor 

(detect and/or quantify) the expression of any desired gene of known sequence or 
subsequence. Moreover, these methods permit monitoring expression of a large number 
of genes simultaneously and effect significant advantages in reduced labor, cost and time. 
The simultaneous monitoring of the expression levels of a multiplicity of genes permits 

20 effective comparison of relative expression levels and identification of biological 
conditions characterized by alterations of relative expression levels of various genes. 
Genes of particular interest for expression monitoring include genes involved in the 
pathways associated with various pathological conditions (e.g., cancer) and whose 
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expression is thus indicative of the pathological condition. Such genes include, but are 
not limited to the HER2 (c-erbB-2/neu) proto-oncogme in the case of breast cancer, 
receptor tyrosine kinases (RTKs) associated with the etiology of a number of tumors 
• including carcinomas of the breast, liver, bladder, pancreas, as well as glioblastomas, 
5 sarcomas and squamous carcinomas, and tumor suppressor genes such as the P53 gene 
and other "marker" genes such as RAS, MSH2, MLHl and BRCAl. Other genes of 
particular interest for expression monitoring are genes involved in the immune response 
(e.g., interleukin genes), as well as genes involved in cell adhesion (e.g., the integrins or 
selectins) and signal transduction (e.g., tyrosine kinases), etc. 

10 In another embodiment, this invention provides for a method of selecting a set of 

oligonucleotide probes, that specifically bind to a target nucleic acid (e.g., a gene or genes 
whose expression is to be monitored or nucleic acids derived from the gene or its 
transcribed mENA). The method involves providing a higji density array of 
oligonucleotide probes where the array comprises a multiplicity of probes wherein each 

15 probe is complementary to a subsequence of the target nucleic acid. The target nucleic 
acid is then hybridized to the array of oligonucleotide probes to identify and select those 
probes where the difference in hybridization signal intensity between each probe and its 
mismatch control is detectable (preferably greater than about 10% of the background 
signal intensity, more preferably greater than about 20% of the background signal 

20 intensity and most preferably greater than about 50% of the background signal intensity). 
The method can further comprise hybridizing the array to a second pool of nucleic acids 
comprising nucleic acids other than the target nucleic acids; and identifying and selecting 
probes having the lowest hybridization signal and where both the probe and its mismatch 
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control have a hybridization intensity equal to or less than about 5 times the background 
signal intensity, preferably equal to or less than about 2 times the background signal 
intensity, more preferably equal to or less than about 1 times the background signal 
intensity, and most preferably equal or less than about half the backgroimd signal 
5 intensity. 

In a preferred embodiment, the multiplicity of probes can include every different 
probe of length n that is conq>lementaiy to a subsequence of the target nucleic acid. The 
probes can range from about 10 to about 50 nucleotides in length. The array is preferably 
a high density array as described above. Similarly, the hybridization methods, conditions, 

10 times, fluid volumes, detection methods are as described above and herein below. 

In addition, this invention provides for a composition comprising an array of 
oligonucleotide probes immobilized on a substrate, where the array comprises more than 
100 different oligonucleotides and each different oUgonucleotide is localized in a 
predetermined region of the solid support and the density of the array is greater than about 

15 60 different oligonucleotides per 1 cm^ of substrate. The oligonucleotide probes are 
specifically hybridized to one or more fluorescentiy labeled nucleic acids such that the 
fluorescence in each region of the array is indicative of the level of expression of each of 
a multiplicity of preselected genes. The array is preferably a high density array as 
described above and may further comprise expression level controls, mismatch controls 

20 and nomialization controls as described herein. 

Finally, this invention provides for kits for simultaneously monitoring expression 
levels of a multiplicity of genes. The kits include an array of immobilized oligonucleotide 
probes complementary to subsequences of the multiplicity of target genes, as described 
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above. In one embodiment, the airay comprises at least 100 different oligonucleotide 
probes and the density of the array is greater than about 60 different oligonucleotides per 
1 cm^ of surface. The kit may also include instructions describing the use of the array for 
detection and/or quantification of expression levels of the multiplicity of genes. The kit 
5 may additionally include one or more of the following: buffers, hybridization mix, wash 
and read solutions, labels, labeling reagents (enzymes etc.), "control" nucleic acids, 
software for probe selection, array reading or data analysis and any of the other materials 
or reagents described herein for the practice of the claimed methods. 

With regard to the present invention, the phrase "massively parallel screening" 
10 refers to the simultaneous screening of at least about 100, preferably about 1000, more 
preferably about 10,000 and most preferably about 1,000,000 different nucleic acid 
hybridizations. 

The terms "nucleic acid" or "nucleic acid molecule" refer to a deoxyribonucleotide 
or ribonucleotide polymer in either single-or double-stranded form, and unless otherwise 
15 limited, would encompass known analogs of natural nucleotides that can function in a 
similar mann^ as naturally occurring nucleotides. 

An oligonucleotide is a single-stranded nucleic acid ranging in length j&om 2 to 
about 500 bases. 

As used herein a "probe" is defined as an oligonucleotide capable of binding to a 
20 target nucleic acid of complementary sequence through one or more types of chemical 
bonds, usually through complementary base pairing, usually through hydrogen bond 
formation. As used herein, an oligonucleotide probe may include natural (i.e. A, G, C, or 
T) or modified bases (7-deazaguanosine, inosine, etc.). In addition, the bases in 
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oligonucleotide probe may be joined by a linkage other than a phosphodiester bond, so 
long as it does not interfere with hybridization. Thus, oligonucleotide probes may be 
peptide nucleic acids in which the constituent bases are joined by p^tide bonds rather 
than phosphodiester linkages. 

The term "target nucleic acid" refers to a nucleic acid (often derived from a 
biological sample), to which the ohgonucleotide probe is designed to specifically 
hybridize. It is either the presence or absence of the target nucleic acid that is to be 
detected, or the amount of the target nucleic acid that is to be quantified. The target 
nucleic acid has a sequence that is complementary to the nucleic acid sequence of the 
corresponding probe directed to the target. The term target nucleic acid may refer to the 
specific subsequence of a larger nucleic acid to which the probe is directed or to the 
overall sequence (e.g., gene or mRNA) whose expression level it is desured to detect. The 
diffidence in usage will be appar^t from context. 

"Subsequence" refers to a sequence of nucleic acids that comprise a part of a 
longer sequence of nucleic acids. 

The temi "complexity" is used here according to standard meaning of this term as 
established by Britten et al. Methods of Enzymol. 29:363 (1974). See, also Cantor and 
Schimmel Biophysical Chemistry: Part HI at 1228-1230 for fiirther explanation of nucleic 
acid complexity. 

"Bind(s) substantially* refers to complementary hybridization between a probe 
nucleic acid and a target nucleic acid and embraces minor mismatches that can be 
accommodated by reducing the stringency of the hybridization media to achieve the 
desired detection of the target polynucleotide sequence. 

23 
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The phrase "hybridizing specifically to", refers to the binding, duplexing, or 
hybridizing of a molecule only to a particular nucleotide sequence under stringent 
conditions when that sequence is present in a complex mixture (e.g., total cellular) DNA 
or RNA. The term "stringent conditions" refers to conditions under which a probe will 
5 hybridize to its target subsequence, but to no other sequences. Stringent conditions are 
sequence-dependent and will be different in different circumstances. Longer sequences 
hybridize specifically at higher temp^tures. Generally, stringent conditions are selected 
to be about 5° C. lower than the thermal melting point (Tm) for the specific sequence at a 
defined ionic strength and pH. The Tm is the temperature (under defined ionic strength, 

10 pH, and nucleic acid concentration) at which 50% of the probes complementary to the 
target sequence hybridize to the target sequence at equilibrium. (As the target sequences 
are generally present in excess, at Tm, 50% of the probes are occupied at equilibrium). 
Typically, stringent conditions will be those in which the salt concentration is at least 
about 0.01 to 1.0 M Na ion concentration (or other salts) at pH 7.0 to 8.3 and the 

15 temperature is at least about 30** C. for short probes (e.g., 10 to 50 nucleotides). Stringent 
conditions may also be achieved with the addition of destabilizing agents such as 
formamide. 

The term "mismatch control" refers to a probe that has a sequence deliberately 
selected not to be perfectly complementary to a particular target sequence. The mismatch 
20 control typically has a corresponding test probe lhat is perfectly complementary to the 
same particular target sequence. The mismatch may comprise one or more bases. While 
the mismatch(s) may be locates anywhere in the mismatch probe, terminal mismatches 
are less desirable as a terminal mismatch is less likely to prevaat hybridization of the 
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target sequence. Id a particularly preferred embodiment, the mismatch is located at or 
near the center of the probe such that the mismatch is most likely to destabilize the diq>lex 
with the target sequence under the test hybridization conditions. 

The terms "backgroxmd" or "background signal intensity" refer to hybridization 
5 signals resulting fix)m non-specific binding, or other interactions, between the labeled 
target nucleic acids and components of the oligonucleotide array (e.g., the oligonucleotide 
probes, control probes, the array substrate, etc.). Background signals may also be 
produced by intrinsic fluorescence of the array components themselves. A single 
background signal can be calculated for the entire array, or a difTerent background signal 

10 may be calculated for each target nucleic acid. In a preferred embodiment, background is 
calculated as the average hybridization signal intensity for the lowest S% to 10% of tibe 
probes in the array, or, where a different background signal is calculated for each target 
gene, for the lowest 5% to 10% of the probes for each gene. Of course, one of skiU in the 
art will appreciate that where the probes to a particular gene hybridize well and thus 

15 appear to be specifically binding to a target sequence, they should not be used in a 
background signal calculation. Alternatively, background may be calculated as the 
average hybridization signal intensity produced by hybridization to probes that are not 
complementary to any sequence foimd in the sample (e.g. probes directed to nucleic acids 
of the opposite sense or to genes not found in the sample such as bacterial genes where 

20 the sample is mammalian nucleic acids). Background can also be calculated as the 
average signal intensity produced by regions of the array that lack any probes at all. 

The term "quantifying" when used in the context of quantifying transcription 
levels of a gene can refer to absolute or to relative quantification. Absolute quantification 
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may be accomplished by inclusion of known concentration(s) of one or more target 
nucleic acids (e.g. control nucleic acids such as Bio B or with known amounts the target 
nucleic acids themselves) and referencing the hybridization intensity of mzknowns with 
the known target nucleic acids (e.g. through generation of a standard curve). 
5 Alternatively, relative quantification can be accomplished by comparison of hybridization 
signals between two or more genes, or between two or more treatments to quantify the 
changes in hybridization intensity and, by impUcation, transcription level. 

An object of the present invention is to use gene expression to predict whether a 
compound has a high probability of being toxic at a given dose. In the system and 

10 method of the present invention, pattems of gene expression are compared against known 
"toxic" pattems and a similarity score calculated. 

To accomplish those ends, the present invention provides a system and method for 
identifying gene expression pattems associated with various modes of toxicity; 
quantifying this association; develop a statistical inference of similarity; and vaUdating 

15 the results of the toxicity test 

It will be appreciated that there are preferred characteristics of the present 
invention. These characteristics include time stabiUty, dose dependence, vehicle 
independence, predictability, and power of the analysis. Specifically, the analysis should 
be time-stable in that it must be able to predict toxicity over an extended time range. In 

20 addition, the analysis should be dose-dependent such that it will only score toxic doses of 
compounds. Further, the analysis is preferably vehicle-independent, where it is not 
sensitive to the type of vehicle used. The analysis is also predictable, where the resultant 
statistical inference has a known false positive rate. Additionally, the analysis is powerful 
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SO that false negative rates are low enough that singletons or low number of replicates can 
adequately predict toxicity. 

Two models, acetaminophen (APAP) and CCI4 have been tested. With APAP, 
the tissues were assayed at 3, 6, and 24 hours, at three dosages (V, L, H dose). With 
CCI4, the tissues were assayed at 1, 3, 6, 24, and 72 hours, at two dosages (V, H dose).lh 
addition, various vehicle control samples were tested, including 74 samples of multiple 
types of vehicles, including oil, gum, and saline, at time points of 0, 1, 3, 6, 24, 48, 72 
hours, and 7 days. In addition, other toxins were assayed, including methotrexate, 
thioacetamide, and CHCI3. 

For CCI4 147 patterns were observed, from which were selected 38 pattems with 
816 genes, resulting in a prediction based on 4 principal components, with CCI4 
considered toxic at all time points 

For APAP 505 pattems were observed, from which were selected 28 pattems with 
1024 genes, resulting in a prediction based on 8 principal components, with high doses of 
APAP considered toxic at all time points. 

For CCI4, there were 3 out of 74 (4.1%) false positives for all samples and 2 out of 
53 (3.8%) for samples not in the model. 

APAP, there were 3 out of 74 (4.1%) false positives for all samples and 3 out of 
44 (6.8%) for samples not in the model. 

In addition, there were detected 703 genes specific to CCI4, 91 1 genes specific to 
APAP and 1 13 genes in common. 

Figure 2 presents the principal component analysis of the CCI4 data. 

Figure 3 presents the principal component analysis of the APAP data. 

27 
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Figure 4 presents the APAP predictive similarity model. 
Figure 5 presents the CCU predictive similarity model. 



5 It will be ^predated that the present invention can be carried out in multiple 

stages. Specifically, in one preferred embodiment there are four stages of development: 
selection, quantification, prediction, and validation. In the selection stage, relevant 
expression patterns that are time stable and dose dependent are determined. In the 
quantification stage, composite measures that define patterns are produced. In the 

10 prediction stage, composite measures to assign probability of similarity of patterns are 
generated. In the validation stage, statistical measures of model accuracy are provided. 

The present invention, a method and system for expression similarity profiling for 
predictive toxicology, employs a number of different methods for multivariate statistical 
analysis. In a preferred ^bodiment, contrast analysis is employed in conjunction with an 

15 analysis of variance (ANOVA) for each gene. Jn this methodology, as input, the average 
di£ference for all samples and all genes is generated. Subsequently, an ANOVA analysis 
is performed. 

Analysis of variance (ANOVA) is used to test hypotheses about diflferences 
between two or more means. The t-test based on the standard error of the difference 
20 between two means can only be used to test differences between two means. When there 
are more than two means, it is possible to compare each mean with each other mean using 
t-tests. However, conducting multiple t-tests can lead to severe inflation of the 
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BQ42Rf;iitmiType I CTTOT rate. Analysis of variance can be used to test differences among 
several means for significance without increasing the Type I error rate. 

In a preferred embodiment of the present invention using ANOVA analysis, two 
factors (time, dose) are jStted, using contrast analysis to assign each gene to a pattern. In 
5 a particularly preferred embodiment of the present invention, the gene response is fitted to 
one of a small number of useful patterns. In reality, there are many patterns that could 
exhibit themselves. This potentially large number of patterns, however, is made up of 
many simple pattems and only a small number of these patterns are useful in predicting 
toxicity. 

10 For example if a single does of a drug and a vehicle is administered at three time 

points. Then, for each time point a gene would demonstrate a basic pattern of either 
upregulated, downregulated, or not significantly changing. The number of patterns 
produced would then be three for each time which would mean that 3 x 3 x 3 = 27 
pattems can be produced. When we have multiple doses and a larger number of time 

15 points, the number of pattems can be extensive. But only a small number of these 
pattems are useful. 

To be useful, a pattem must demonstrate time stability. In that regard, the change 
in gene expression should go in the same direction for two or more time points and not 
change direction in adjacent time points relative to the time points where gene expression 
20 is changing. 

In addition, a useful pattem will preferably demonstrate a dose dependence when 
multiple doses are used, such as in the APAP model. At the high doses, the pattem must 
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increase or decrease relative to the vehicle and must also inorease or decrease fbom non- 
toxic doses of that substance in the same direction. 

Further, for multiple doses, a general directionality preferably is demonstrated. 
As the dose increases, the amount of change in gene expression is either increasing or 
decreasing in the same direction. This can be characterized as a directionality of the 
pattem in response to an iacreasing dose. 

Thus, the use of contrast analysis permits selection of only those patterns that 
which are useful with respect to time stability and dose dependence, with a level of 
confidence in the result based on the appropriate statistical measure (ANOVA). 

Upon the conclusion of the analysis, the output provides a list of patterns and a list 
of genes within each pattem with measures of goodness of fit. 

With regard to quantification of the toxicological response, principal component 
analysis (PCA) is employed. Here for input, genes are selected for patterns that are 
biologically relevant to the toxicological process. Then, PCA analysis is performed on all 
samples. The resultant output is 1 to 8 summary scores for each sample. 

In the subsequent step, as input, the 1 to 8 summary scores per sample are used as 
indicators of the toxicity for each sample. In the analysis, a logistical regression analysis 
mBps scores on a 0 to 1 scale of toxicity. The resultant output is a mathematical formula 
that converts column of summary scores into a single 0 to 1 toxicological score for a 
sample. With CCI4, there were 147 patterns generated. 38 patterns with 816 genes 
were selected. Predictions were based on 4 principal components, with CCI4 considered 
toxic at all time points. With APAP, there were 505 patterns generated. 28 patterns with 
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1024 gmes were selected. This was resolved into 8 principal components^ with APAP 
high dose considered toxic at all time points. 





ecu 


APAP 


Percent False Positive 


3/74 (4.1%) 


3/74 


(All Samples) 




(4.1%) 


Percent False Positive 


2/53 (3.8%) 


3/44 


(Samples not in Model) 




(6.8%) 



5 The present invention will allow for the development of models for key 

compounds; cross-validation of various toxicological models; allow for discrimraation of 
false positive and false negative readings; reduction of toxicological models to a best set 
of toxic markers; and prediction regarding the toxicity of unknown compounds 
The classification of objects into one or more groups based on many 

10 measurements has several well established techniques. These include discriminant 

analysis, logistic regression, multidimensional scaling, clustering, and neural networks. 
A general discussion of each technique can be found in 'Multivariate Analysis, Prentice 
Hall ISBN 0-13-894858," which is mcorporated herein by reference. All of these 
methods work by making composite measures from the many measurements taken &om 

15 each object. With gene expression patterns we have several time and dose points which 
represent multiple objects that are grouped together. None of these techniques are 
sufficient alone to represent this order of complexity. Contrast analysis allows us to 
identify measurem^ts that are partial independent of time because they are time stable 
yet are affected by toxic doses more then non toxic doses. The PCA combines these 

20 many measurements into a series of orthogonal composite measures. Since these 
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composite measures are non coirelated by definition the pxoblem of multicolinearity 
which can decrease the power of logistic regression is eliminated. By combining these 
techniques in the order described many of tiie limitations of each individual technique is 
reduced. 

The following is a model developed fi-om gene expression of rat livers using 
Affymetrix RU35 Rat Chip data. The rats were either treated with a toxic dose, non-toxic 
dose or vehicle controls. The raw expression data expressed as normalized average 
differences were then entered into flie model described here. 

In achieving this analysis, a preferred expression similarity profiling for predictive 
toxicology algorithm is employed. In this algorithm, let Xij represent gene expression 
values for the i*th gene and j 'th sample (I = 1 to I, j = 1 to J). Let Yj, Dj, and Tj represent 
the indicator of toxicity for the j'th sample, the dose for the j'th sample, and the time for 
the j'th sample, respectively. Li the first step, time stable and dose dependent patterns are 
selected. For gene i, fit a two-fector analysis of variance model. This model can be 
expressed as 

Xij=a+b*Dj+c*Tj+d*Dj*Tj, 
for the case of two dose groups (Dj = 0 or 1) and two time points (Tj = 0 or 1). In this 
model, the parameters (a, b, c, d) are estimated via a least squares algorithm. 
Accommodating additional time/dose levels is accomplished by adding additional model 
parameters for each additional time and^or dose level. For example, the case of four time 
points (Tj = 0 or 1 or 2 or 3) and three dose groups (Dj = 0 or 1 or 2) can be expressed as 

Xij = a+bl*Dlj+b2*D2j+cl*Tlj+c2*T2j+c3*T3j+dl*Dlj*Tlj+ d2*Dlj*T2j+ 
d3*Dlj*T3j+d4*D2j*Tlj+ d5*D2j*T2j+ d6*D2j*T3j, 

32 
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where Tlj = 1 if Tj = 1, T2j = 1 if Tj = 2, etc. The parameters (a, bl, b2, cl, c2, c3, dl, 
d2, d3, d4, d5, d6) are estimated as above. 

In the subsequent step, genes are categorized according to the magnitude, sign, 
and significance level of the estimated parameters. Genes are selected for multivariate 
5 statistical analysis of the algorithm if they exhibit dose efiGscts (significant bl, b2,, . . 
parameters) without time effects (non-significant cl, c2,... parameters). 

In carrying out the multivariate statistical analysis, the multiple variables are 
resolved into several components. For the reduced data matrix X'ij (i = genes selected 
firom step 1, j=l to J), a principal components analysis is performed. The result of this 
10 analysis is a series of J principal components, and a score matrix S, where Sij represents 
the value of the i'th principal component for the j'th sample. 

In the next step, a step-up logistic regression procedure is employed, where 
initially a model with one principal component is fit 

Log(Yj/(l-Yj))=a+bl*Slj 
15 The parameters a and bl are estimated via maximum likelihood estimation. Additional 
con:5)onents are added into the model if the model fit would be improved. 

This model is used to predict the probability of toxicity for each of the J samples. 
If the probability for the known toxins is consistently high and the probability for the 
known non-toxins is consistently low, then the model is accepted. Otherwise, alter the 
20 gene selection criteria, and redo the multivariate statistical analysis. 

The invention consists of three distinct stages. At each stage, small variations in 
technique can be used to accomplish the same task. The first stage, selection of time 
stable and dose dependent patterns by contrast analysis, can be alt^ed by changing the 



wo 02/059560 



PCTAJS02/01633 



method of measuxing variation. We use a method that is based on analysis of variance, 
where the time component and dose component are assessed simultaneously. One could 
use a series of t test on individual parts of the pattem to get a collective set of p values 
that could approximate our method of measuring variation. One could also set an 
5 arbitrary firactional cutofiE^ mean or median of experimental group divided by control 
group, to approximate the measurement of variation for each part of the pattem that is 
then use in the next to stages of analysis. The novel feature is to find time stable and dose 
dependent patterns with a predicted p value for that pattem. 

The second stage, reduction of thousands of variables into one or more composite 

10 variables, is accomplished by principal component analysis. Alternative methods exist to 
produce a composite measure. Partial least squares can be used with control and 
experimental group being assigned values as dependent variables. Factor analysis has also 
been used in other settings to reduce many variables into one composite variable. 

The third stage, use of composite variables to make one predictive composite 

15 measure, is accomplished by entering the principle components, the composite measures 
from PCA analysis, into a logistic regression. The dependent variable in a logistic 
regression is the chance of a positive, toxic, or negative, non toxic, outcome that is 
bounded by the values 1 and 0 respectively. Discriminant analysis could also be used to 
classify the samples as toxic or non toxic and the discriminant Z scores and distances 

20 fcom the centroids of groups with respect to the Z score variations could be used as 
alternative method for creating a probability score. 
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Various prefCTred embodiments of the invention have been described in 
fiilfiUment of the various objects of the invention. It should be recognized that these 
embodiments are merely illustrative of the principles of the invention. Numerous 
modifications and adaptations thereof will be readily apparent to those skilled in the art 
5 without departing from the spirit and scope of the present invention. 
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1 . A method for assessing toxicity and toxicology of a substance, comprising: 

exposing a set of at least two genes to the substance; 

monitoring the response of each gene in the set of genes to the substance; 

analyzing the variance of the response to the substance for each gene using 
contrast analysis; 

constructing a summary score for each gene in the set of genes; 

performing a logistic regression analysis upon the summary scores; and 

using the results of the logistic regression analysis to provide a predictive model 
regarding the toxicity and toxicology of the substance. 
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